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Abstract. In this paper we discuss new adaptive proposal strategies for sequential Monte 
Carlo algorithms — also known as particle filters — relying on criteria evaluating the quality 
of the proposed particles. The choice of the proposal distribution is a major concern and 
can dramatically influence the quality of the estimates. Thus, we show how the long-used 
coefficient of variation (suggested by Kong et al. (1994)) of the weights can be used for 
estimating the chi-square distance between the target and instrumental distributions of the 
auxiliary particle filter. As a by-product of this analysis we obtain an auxiliary adjustment 
multiplier weight type for which this chi-square distance is minimal. Moreover, we establish 
an empirical estimate of linear complexity of the Kullback-Leibler divergence between the 
involved distributions. Guided by these results, we discuss adaptive designing of the particle 
filter proposal distribution and illustrate the methods on a numerical example. 



1. Introduction 

Easing the role of the user by tuning automatically the key parameters of sequential Monte 
Carlo (SMC) algorithms has been a long-standing topic in the community, notably through 
adaptation of the particle sample size or the way the particles are sampled and weighted. 
In this paper we focus on the latter issue and develop methods for adjusting adaptively the 
proposal distribution of the particle filter. 

Adaptation of the number of particles has been treated by several authors. In Legland 
and Oudjane (2006) (and later Hu et al. (2008, Section IV)) the size of the particle sample 
is increased until the total weight mass reaches a positive threshold, avoiding a situation 
where all particles are located in regions of the state space having zero posterior probability. 
Fearnhead and Liu (2007, Section 3.2) adjust the size of the particle cloud in order to control 
the error introduced by the resampling step. Another approach, suggested by Fox (2003) 
and refined in Soto (2005) and Straka and Simandl (2006), consists in increasing the sample 
size until the Kullback-Leibler divergence (KLD) between the true and estimated target 
distributions is below a given threshold. 
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Unarguably, setting an appropriate sample size is a key ingredient of any statistical esti- 
mation procedure, and there are cases where the methods mentioned above may be used for 
designing satisfactorily this size; however increasing the sample size only is far from being 
always sufficient for achieving efficient variance reduction. Indeed, as in any algorithm based 
on importance sampling, a significant discrepancy between the proposal and target distri- 
butions may require an unreasonably large number of samples for decreasing the variance 
of the estimate under a specified value. For a very simple illustration, consider importance 
sampling estimation of the mean m of a normal distribution using as importance distribution 
another normal distribution having zero mean and same variance: in this case, the variance 
of the estimate grows like exp(m 2 )/N, N denoting the number of draws, implying that the 
sample size required for ensuring a given variance grows exponentially fast with m. 

This points to the need for adapting the importance distribution of the particle filter, 
e.g., by adjusting at each iteration the particle weights and the proposal distributions; see 
e.g. Doucet et al. (2000), Liu (2004), and Fearnhead (2008) for reviews of various filtering 
methods. These two quantities are critically important, since the performance of the particle 
filter is closely related to the ability of proposing particles in state space regions where the 
posterior is significant. It is well known that sampling using as proposal distribution the 
mixture composed by the current particle importance weights and the prior kernel (yielding 
the classical bootstrap particle filter of Gordon et al. (1993)) is usually inefficient when the 
likelihood is highly peaked or located in the tail of the prior. 

In the sequential context, the successive distributions to be approximated (e.g. the succes- 
sive filtering distributions) are the iterates of a nonlinear random mapping, defined on the 
space of probability measures; this nonlinear mapping may in general be decomposed into 
two steps: a prediction step which is linear and a nonlinear correction step which amounts 
to compute a normalisation factor. In this setting, an appealing way to update the current 
particle approximation consists in sampling new particles from the distribution obtained by 
propagating the current particle approximation through this mapping; see e.g. Hiirzeler and 
Kiinsch (1998), Doucet et al. (2000), and Kiinsch (2005) (and the references therein). This 
sampling distribution guarantees that the conditional variance of the importance weights 
is equal to zero. As we shall see below, this proposal distribution enjoys other optimality 
conditions, and is in the sequel referred to as the optimal sampling distribution. However, 
sampling from the optimal sampling distribution is, except for some specific models, a dif- 
ficult and time-consuming task (the in general costly auxiliary accept-reject developed and 
analysed by Kiinsch (2005) being most often the only available option). 

To circumvent this difficulty, several sub-optimal schemes have been proposed. A first 
type of approaches tries to mimic the behavior of the optimal sampling without suffering the 
sometimes prohibitive cost of rejection sampling. This typically involves localisation of the 
modes of the unnormalised optimal sampling distribution by means of some optimisation 
algorithm, and the fitting of over-dispersed student's t-distributions around these modes; see 
for example Shephard and Pitt (1997), Doucet et al. (2001), and Liu (2004) (and the refer- 
ences therein). Except in specific cases, locating the modes involves solving an optimisation 
problem for every particle, which is quite time-consuming. 
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A second class of approaches consists in using some classical approximate non-linear fil- 
tering tools such as the extended Kalman filter (EKF) or the unscented transform Kalman 
filter (UT/UKF); see for example Doucet et al. (2001) and the references therein. These 
techniques assume implicitly that the conditional distribution of the next state given the 
current state and the observation has a single mode. In the EKF version of the particle 
filter, the linearisation of the state and observation equations is carried out for each indi- 
vidual particle. Instead of linearising the state and observation dynamics using Jacobian 
matrices, the UT/UKF particle filter uses a deterministic sampling strategy to capture the 
mean and covariance with a small set of carefully selected points (sigma points), which is 
also computed for each particle. Since these computations are most often rather involved, a 
significant computational overhead is introduced. 

A third class of techniques is the so-called auxiliary particle filter (APF) suggested by Pitt 
and Shephard (1999), who proposed it as a way to build data-driven proposal distributions 
(with the initial aim of robustifying standard SMC methods to the presence of outlying 
observations); see e.g. Fearnhead (2008). The procedure comprises two stages: in the first- 
stage, the current particle weights are modified in order to select preferentially those particles 
being most likely proposed in regions where the posterior is significant. Usually this amounts 
to multiply the weights with so-called adjustment multiplier weights, which may depend on 
the next observation as well as the current position of the particle and (possibly) the proposal 
transition kernels. Most often, this adjustment weight is chosen to estimate the predictive 
likelihood of the next observation given the current particle position, but this choice is not 
necessarily optimal. 

In a second stage, a new particle sample from the target distribution is formed using this 
proposal distribution and associating the proposed particles with weights proportional to 
the inverse of the adjustment multiplier weight \ APF procedures are known to be rather 
successful when the first-stage distribution is appropriately chosen, which is not always 
straightforward. The additional computational cost depends mainly on the way the first- 
stage proposal is designed. The APF method can be mixed with EKF and UKF leading to 
powerful but computationally involved particle filter; see, e.g., Andrieu et al. (2003). 

None of the suboptimal methods mentioned above minimise any sensible risk-theoretic 
criterion and, more annoyingly, both theoretical and practical evidences show that choices 
which seem to be intuitively correct may lead to performances even worse than that of 
the plain bootstrap filter (see for example Douc et al. (2008) for a striking example). The 
situation is even more unsatisfactory when the particle filter is driven by a state space 
dynamic different from that generating the observations, as happens frequently when, e.g., 
the parameters are not known and need to be estimated or when the model is misspecified. 

Instead of trying to guess what a good proposal distribution should be, it seems sensible 
to follow a more risk-theoretically founded approach. The first step in such a construction 

1 The original APF proposed by Pitt and Shephard (1999) features a second resampling procedure in order 
to end-up with an equally weighted particle system. This resampling procedure might however severely reduce 
the accuracy of the filter: Carpenter et al. (1999) give an example where the accuracy is reduced by a factor 
of 2; see also Douc et al. (2008) for a theoretical proof. 
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consists in choosing a sensible risk criterion, which is not a straightforward task in the SMC 
context. A natural criterion for SMC would be the variance of the estimate of the posterior 
mean of a target function (or a set of target functions) of interest, but this approach does 
not lead to a practical implementation for two reasons. Firstly, in SMC methods, though 
closed-form expression for the variance at any given current time-step of the posterior mean 
of any function is available, this variance depends explicitly on all the time steps before the 
current time. Hence, choosing to minimise the variance at a given time-step would require to 
optimise all the simulations up to that particular time step, which is of course not practical. 
Because of the recursive form of the variance, the minimisation of the conditional variance at 
each iteration of the algorithm does not necessarily lead to satisfactory performance on the 
long-run. Secondly, as for the standard importance sampling algorithm, this criterion is not 
function-free, meaning that a choice of a proposal can be appropriate for a given function, 
but inappropriate for another. 

We will focus in the sequel on function-free risk criteria. A first criterion, advocated in 
Kong et al. (1994) and Liu (2004) is the chi-square distance (CSD) between the proposal 
and the target distributions, which coincides with the coefficient of variation (CV 2 ) of the 
importance weights. In addition, as heuristically discussed in Kong et al. (1994), the CSD is 
related to the effective sample size, which estimates the number of i.i.d. samples equivalent 
to the weighted particle system 2 . In practice, the CSD criterion can be estimated, with a 
complexity that grows linearly with the number of particles, using the empirical CV 2 which 
can be shown to converge to the CSD as the number of particles tends to infinity. In this 
paper we show that a similar property still holds in the SMC context, in the sense that the 
CV 2 still measures a CSD between two distributions /i* and n*, which are associated with 
the proposal and target distributions of the particle filter (see Theorem 4.1(h)). Though 
this result does not come as a surprise, it provides an additional theoretical footing to an 
approach which is currently used in practice for triggering resampling steps. 

Another function-free risk criterion to assess the performance of importance sampling 
estimators is the KLD between the proposal and the target distributions; see (Cappe et al., 
2005, Chapter 7). The KLD shares some of the attractive properties of the CSD; in particular, 
the KLD may be estimated using the negated empirical entropy S of the importance weights, 
whose computational complexity is again linear in the number of particles. In the SMC 
context, it is shown in Theorem 4.1(i) that £ still converges to the KLD between the same 
two distributions fi* and n* associated with the proposal and the target distributions of the 
particle filter. 

Our methodology to design appropriate proposal distributions is based upon the minimi- 
sation of the CSD and KLD between the proposal and the target distributions. Whereas 
these quantities (and especially the CSD) have been routinely used to detect sample impov- 
erishment and trigger the resampling step (Kong et al., 1994), they have not been used for 
adapting the simulation parameters in SMC methods. 



In some situations, the estimated ESS value can be misleading: see the comments of Stephens and 
Donnelly (2000) for a further discussion of this. 
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We focus here on the auxiliary sampling formulation of the particle filter. In this setting, 
there are two quantities to optimise: the adjustment multiplier weights (also called first- 
stage weights) and the parameters of the proposal kernel; together these quantites define 
the mixture used as instrumental distribution in the filter. We first establish a closed- 
form expression for the limiting value of the CSD and KLD of the auxiliary formulation of 
the proposal and the target distributions. Using these expressions, we identify a type of 
auxiliary SMC adjustment multiplier weights which minimise the CSD and the KLD for a 
given proposal kernel (Proposition 4.2). We then propose several optimisation techniques 
for adapting the proposal kernels, always driven by the objective of minimising the CSD 
or the KLD, in coherence with what is done to detect sample impoverishment (see Section 
5). Finally, in the implementation section (Section 6), we use the proposed algorithms for 
approximating the filtering distributions in several state space models, and show that the 
proposed optimisation procedure improves the accuracy of the particle estimates and makes 
them more robust to outlying observations. 



2.1. Adaptive importance sampling. Before stating and proving rigorously the main 
results, we discuss informally our findings and introduce the proposed methodology for de- 
veloping adaptive SMC algorithms. Before entering into the sophistication of sequential 
methods, we first briefly introduce adaptation of the standard (non-sequential) importance 
sampling algorithm. 

Importance sampling (IS) is a general technique to compute expectations of functions 
with respect to a target distribution with density p(x) while only having samples generated 
from a different distribution — referred to as the proposal distribution — with density q(x) 
(implicitly, the dominating measure is taken to be the Lebesgue measure on S = M d ). We 
sample {£i}i=i from the proposal distribution q and compute the unnormalised importance 
weights uJi = W(£i), i = 1, . . ., N, where W(x) = p(x)/q(x). For any function /, the self- 
normalised importance sampling estimator may be expressed as ISjv(/) — ^jy 1 ^iLi^ifid), 
where f2jy — J2j=i ^j- As usual in applications of the IS methodology to Bayesian inference, 
the target density p is known only up to a normalisation constant; hence we will focus only 
on a self-normalised version of IS that solely requires the availability of an unnormalised 
version of p (see Geweke, 1989). Throughout the paper, we call a set {£i}i=i of random 
variables, referred to as particles and taking values in S, and nonnegative weights {cUj}^ a 
weighted sample on H. Here N is a (possibly random) integer, though we will take it fixed in 
the sequel. It is well known (see again Geweke, 1989) that, provided that / is integrable with 
respect to p, i.e. J \f(x)\p(x) dx < oo, ISjv(/) converges, as the number of samples tends to 
infinity, to the target value 



2. Informal presentation of the results 
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for any function / G C, where C is the set of functions which are integrable with respect to 
to the target distribution p. Under some additional technical conditions, th is estimator is 
also asymptotically normal at rate yN; see Geweke (1989). 

It is well known that IS estimators are sensitive to the choice of the proposal distribution. 
A classical approach consists in trying to minimise the asymptotic variance with respect 
to the proposal distribution q. This optimisation is in closed form and leads (when / is 
a non-negative function) to the optimal choice q*(x) = f(x)p(x)/ J f(x)p(x) dx, which is, 
since the normalisation constant is precisely the quantity of interest, rather impractical. 
Sampling from this distribution can be done by using an accept-reject algorithm, but this 
does not solve the problem of choosing an appropriate proposal distribution. Note that it is 
possible to approach this optimal sampling distribution by using the cross- entropy method; 
see Rubinstein and Kroese (2004) and de Boer et al. (2005) and the references therein. We 
will discuss this point later on. 

For reasons that will become clear in the sequel, this type of objective is impractical in the 
sequential context, since the expression of the asymptotic variance in this case is recursive 
and the optimisation of the variance at a given step is impossible. In addition, in most 
applications, the proposal density is expected to perform well for a range of typical functions 
of interest rather than for a specific target function /. We are thus looking for function-free 
criteria. The most often used criterion is the CSD between the proposal distribution q and 
the target distribution p, defined as 

=/"" (I) " I)dI - 1 ' (2 ' 2) 

= J W{x)p{x) dx - 1 . (2.3) 

The CSD between p and q may be expressed as the variance of the importance weight 
function W under the proposal distribution, i.e. 

d x 2{p\\q)=VM q [W(X)}. 

This quantity can be estimated by computing the squared coefficient of variation of the 
unnormalized weights (Evans and Swartz, 1995, Section 4): 

N 

CV^WD^^^^-l. (2.4) 

i=i 

The CV 2 was suggested by Kong et al. (1994) as a means for detecting weight degeneracy. 
If all the weights are equal, then CV 2 is equal to zero. On the other hand, if all the weights 
but one are zero, then the coefficient of variation is equal to iV — 1 which is its maximum 
value. From this it follows that using the estimated coefficient of variation for assessing 
accuracy is equivalent to examining the normalised importance weights to determine if any 
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are relatively large 3 . Kong et al. (1994) showed that the coefficient of variation of the weights 



CV 2 ({ui}f =1 ) is related to the effective sample size (ESS), which is used for measuring the 
overall efficiency of an IS algorithm: 

a^ess^}^) 4 1 - {i + dAvMY 1 ■ 

Heuristically, the ESS measures the number of i.i.d. samples (from p) equivalent to the N 
weighted samples. The smaller the CSD between the proposal and target distributions is, 
the larger is the ESS. This is why the CSD is of particular interest when measuring efficiency 
of IS algorithms. 

Another possible measure of fit of the proposal distribution is the KLD (also called relative 
entropy) between the proposal and target distributions, defined as 

dsL(p\\q)= fp(x) log (441 (2-5) 



q{x) 

p(x) log W{x) dx , (2.6) 

W(x)\ogW(x)q(x)dx . (2.7) 

This criterion can be estimated from the importance weights using the negative Shannon 
entropy S of the importance weights: 

N 

= n^X) w < lo g(^ lw • ( 2 - 8 ) 

i=l 

The Shannon entropy is maximal when all the weights are equal and minimal when all 
weights are zero but one. In IS (and especially for the estimation of rare events), the KLD 
between the proposal and target distributions was thoroughly investigated by Rubinstein 
and Kroese (2004), and is central in the cross-entropy (CE) methodology. 

Classically, the proposal is chosen from a family of densities qe parameterised by 6. Here 
9 should be thought of as an element of G, which is a subset of M fe . The most classical 
example is the family of student's t-distributions parameterised by mean and covariance. 
More sophisticated parameterisations, like mixture of multi-dimensional Gaussian or Stu- 
dent's t-distributions, have been proposed; see, e.g., Oh and Berger (1992), Oh and Berger 
(1993), Evans and Swartz (1995), Givens and Raftery (1996), Liu (2004, Chapter 2, Section 
2.6), and, more recently, Cappe et al. (2008) in this issue. In the sequential context, where 
computational efficiency is a must, we typically use rather simple parameterisations, so that 
the two criteria above can be (approximatively) solved in a few iterations of a numerical 
minimisation procedure. 



3 Some care should be taken for small sample sizes N; the CV 2 can be low because q sample only over a 
subregion where the integrand is nearly constant, which is not always easy to detect. 
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The optimal parameters for the CSD and the KLD are those minimising 9 \— > d x 2(p\\qg) 
and 9 h- > d KL (p\ \q e ), respectively. In the sequel, we denote by 0£ SD and 6^ LD t nese optimal 
values. Of course, these quantities cannot be computed in closed form (recall that even 
the normalisation constant of p is most often unknown; even if it is known, the evaluation 
of these quantities would involve the evaluation of most often high-dimensional integrals). 
Nevertheless, it is possible to construct consistent estimators of these optimal parameters. 
There are two classes of methods, detailed below. 

The first uses the fact that the the CSD d x 2{p\\q e ) and the KLD g?kl(p|<?0) m ay be ap- 
proximated by (2.4) and (2.8), substituting in these expressions the importance weights by 
&i = W$(£i), i — 1, • • • , N, where Wg = p/qg and {Ci}fLi is a sample from qg. This optimi- 
sation problem formally shares some similarities with the classical minimum chi-square or 
maximum likelihood estimation, but with the following important difference: the integra- 
tions in (2.1) and (2.5) are with respect to the proposal distribution qg and not the target 
distribution p. As a consequence, the particles {£f }£Li in the definition of the coefficient of 
variation (2.4) or the entropy (2.8) of the weights constitute a sample from qg and not from 
the target distribution p. As the estimation progresses, the samples used to approach the 
limiting CSD or KLD can, in contrast to standard estimation procedures, be updated (these 
samples could be kept fixed, but this is of course inefficient). 

The computational complexity of these optimisation problems depends on the way the 
proposal is parameterised and how the optimisation procedure is implemented. Though the 
details of the optimisation procedure is in general strongly model dependent, some common 
principles for solving this optimisation problem can be outlined. Typically, the optimisation 
is done recursively, i.e. the algorithm defines a sequence 9e, £ = 0, 1, ... , of parameters, 
where £ is the iteration number. At each iteration, the value of 9t is updated by computing 
a direction p£ + i in which to step, a step length je+i, and setting 

@e+i = @e + le+iPe+i ■ 

The search direction is typically computed using either Monte Carlo approximation of the 
finite-difference or (when the quantities of interest are sufficiently regular) the gradient of the 
criterion. These quantities are used later in conjunction with classical optimisation strategies 
for computing the step size 7^+1 or normalising the search direction. These implementation 
issues, detailed in Section 6, are model dependent. We denote by Mi the number of particles 
used to obtain such an approximation at iteration £. The number of particles may vary with 
the iteration index; heuristically there is no need for using a large number of simulations 
during the initial stage of the optimisation. Even rather crude estimation of the search 
direction might suffice to drive the parameters towards the region of interest. However, as the 
iterations go on, the number of simulations should be increased to avoid "zi g-zagging" when 
the algorithm approaches convergence. After L iterations, the total number of generated 
particles is equal to N = Yle=i M(. Another solution, which is not considered in this paper, 
would be to use a stochastic approximation procedure, which consists in fixing Mi = M and 
letting the stepsize 7^ tend to zero. This appealing solution has been successfully used in 
Arouna (2004). 
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The computation of the finite difference or the gradient, being defined as expectations of 
functions depending on 9, can be performed using two different approaches. Starting from 
definitions (2.3) and (2.6), and assuming appropriate regularity conditions, the gradient of 
9 i — > d x 2{p\\q e ) and 9 \— > c?kl (p| |<?e) may be expressed as 

G C SB(0)=V e d x 2(p\\q ) = J p{x)V e W e {x)dx = J qg(x)W (x)V e W 8 (x) dx , (2.9) 

Gkld(0) = V e d KL (p\\q ) = j p(x)V e \og[We(x)]dx = j q e (x)V e W e (x) dx . (2.10) 
These expressions lead immediately to the following approximations, 

M 

G C sb(0) = M" 1 W e (&)V e W 9 (&) , (2.11) 



i=l 
M 



G K ld(9) = M- 1 VoW 0l (g') . (2.12) 



i=l 



There is another way to compute derivatives, which shares some similarities with pathwise 
derivative estimates. Recall that for any 9 G O, one may choose Fg so that the random 
variable Fg(e), where e is a vector of independent uniform random variables on [0, l] d , is 
distributed according to q . Therefore, we may express 9 d x 2{p\\q e ) and 9 \— > d KL (p\ \q e ) 
as the following integrals, 



d x 2(p\\qe) = / w e (x) dx 
'{0,l] d 



dKh(p\\qe) = / w e (x) log [w e {x)\ dx , 

J[0,l] d 

where wg(x) = Wg o F e (x). Assuming appropriate regularity conditions (i.e. that 9 \— > Wq o 
Fg(x) is different iable and that we can interchange the integration and the differentiation), 
the differential of these quantities with respect to 9 may be expressed as 

G C sd(9) = / Vgw e (x)dx 1 
J[o,i] d 

G K ld(9) = / {S7gw e (x) \og[we{x)\ + V d wg(x)} dx . 

J[0,l] d 

For any given x, the quantity VgWe(x) is the pathwise derivative of the function 9 \— > wg(x). 
As a practical matter, we usually think of each x as a realization of of the output of an ideal 
random generator. Each wg(x) is then the output of the simulation algorithm at parameter 
9 for the random number x. Each VeWg(x) is the derivative of the simulation output with 
respect to 9 with the random numbers held fixed. These two expressions, which of course 
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coincide with (2.9) and (2.10), lead to the following estimators, 

M 

G CS d(0) = M- l Y,^ew e {e l ) , 

i=i 

M 

G K ld{0) = M~ l {VoM*) log[w (ei)] + V e w e (e t )} , 

i=l 

where each element of the sequence {e^}^£ 1 is a vector on [0, l] d of independent uniform 
random variables. It is worthwhile to note that if the number Mg = M is kept fixed during 
the iterations and the uniforms {ti}f£i are drawn once and for all (i.e. the same uniforms 
are used at the different iterations), then the iterative algorithm outlined above solves the 
following problem: 

9 ~ CV 2 , (2.13) 

0" e (We,)}^) . (2.14) 

From a theoretical standpoint, this optimisation problem is very similar to M-estimation, 
and convergence results for M-estimators can thus be used under rather standard technical 
assumptions; see for example Van der Vaart (1998). This is the main advantage of fixing the 
sample {e^}^. We use this implementation in the simulations. 

Under appropriate conditions, the sequence of estimators #| CSD or 6^ KLD of these criteria 
converge, as the number of iterations tends to infinity, to #csd or ^kld which minimise 
the criteria 9 i— > d x 2(p\\q s ) and 9 h- > cIkl(p\ lie), respectively; these theoretical issues are 
considered in a companion paper. 

The second class of approaches considered in this paper is used for minimising the KLD 
(2.14) and is inspired by the cross-entropy method. This algorithm approximates the min- 
imum #kld °f (2-14) by a sequence of pairs of steps, where each step of each pair ad- 
dresses a simpler optimisation problem. Compared to the previous method, this algorithm is 
derivative-free and does not require to select a step size. It is in general simpler to implement 
and avoid most of the common pitfalls of stochastic approximation. Denote by #o £ @ an 
initial value. We define recursively the sequence {6^}^> as follows. In a first step, we draw 
a sample and evaluate the function 

9 i * Qi{0, 0i) 4 W ei (&) logg„(#) • (2.15) 

i=l 

In a second step, we choose $e + i to be the (or any, if there are several) value of 9 G that 
maximises Qe(0, Og). As above, the number of particles Mg is increased during the successive 
iterations. This procedure ressembles closely the Monte Carlo EM (Wei and Tanner, 1991) 
for maximum likelihood in incomplete data models. The advantage of this approach is that 
the solution of the maximisation problem 9i + \ = argmax^gQ G Qe(9,9g) is often on closed 
form. In particular, this happens if the distribution qg belongs to an exponential family 
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(EF) or is a mixture of distributions of NEF; see Cappe et al. (2008) for a discussion. The 
convergence of this algorithm can be established along the same lines as the convergence of 
the MCEM algorithm; see Fort and Moulines (2003). As the number of iterations I increases, 
the sequence of estimators Qi may be shown to converge to #kld- These theoretical results 
are established in a companion paper. 

2.2. Sequential Monte Carlo Methods. In the sequential context, where the problem 
consists in simulating from a sequence {p k } of probability density function, the situation is 
more difficult. Let be denote the state space of distribution p k and note that this space 
may vary with k, e.g. in terms of increasing dimensionality. In many applications, these 
densities are related to each other by a (possibly random) mapping, i.e. pk = ^k-i(Pk-i)- 
In the sequel we focus on the case where there exists a non-negative function l k _i : (£, £) i— > 
lk-i(£, such that 

J Pk-AK) J d£d£ 

As an example, consider the following generic nonlinear dynamic system described in state 
space form: 



State (system) model 



Transition Density 

X k = a(AVi, U k ) <- qJxT^Q , (2.17) 
• Observation (measurement) model 

Observation Density 

Y k = b(X k ,V k )» gUCjk) • (2.18) 

By these equations we mean that each hidden state X k and data Y k are assumed to be 
generated by nonlinear functions a(-) and &(•), respectively, of the state and observation 
noises U k and V k . The state and the observation noises {U k } k > and {V k } k > are assumed 
to be mutually independent sequences of i.i.d. random variables. The precise form of the 
functions and the assumed probability distributions of the state and observation noises U k 
and V k imply, via a change of variables, the transition probability density function q(x k -i, x k ) 
and the observation probability density function g(x k , y k ), the latter being referred to as the 
likelihood of the observation. With these definitions, the process {X k } k > is Markovian, 
i.e. the conditional probability density of X k given the past states X , k _i = (X , . . . ,X k _i) 
depends exclusively on Xk-i- This distribution is described by the density q(x k -i,x k ). 
In addition, the conditional probability density of Y k given the states X 0:k and the past 
observations Yo:fc-i depends exclusively on X k , and this distribution is captured by the 
likelihood g(x k ,y k ). We assume further that the initial state Xq is distributed according 
to a density function 7To(xo). Such nonlinear dynamic systems arise frequently in many 
areas of science and engineering such as target tracking, computer vision, terrain referenced 
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navigation, finance, pollution monitoring, communications, audio engineering, to list only a 
few. 

Statistical inference for the general nonlinear dynamic system above involves computing 
the posterior distribution of a collection of state variables X s:s i = (X s , . . . ,X S >) conditioned 
on a batch Y 0:k = {Yo, ■ ■ ■ , Yj.) of observations. We denote this posterior distribution by 
(frs-.s'lkiXs-.s'lYoik)- Specific problems include filtering, corresponding to s — s' — k, fixed lag 
smoothing, where s = s' = k — L, and fixed interval smoothing, with s = and s' = k. 
Despite the apparent simplicity of the above problem, the posterior distributions can be 
computed in closed form only in very specific cases, principally, the linear Gaussian model 
(where the functions a(-) and b(-) are linear and the state and observation noises {Uk}k>o 
and {Vk}k>o are Gaussian) and the discrete hidden Markov model (where X k takes its values 
in a finite alphabet). In the vast majority of cases, nonlinearity or non-Gaussianity render 
analytic solutions intractable — see Anderson and Moore (1979); Kailath et al. (2000); Ristic 
et al. (2004); Cappe et al. (2005). 

Starting with the initial, or prior, density function 7r (x ), and observations Y Q . k = y 0:k , 
the posterior density <fik\k{xk\yo:k) can be obtained using the following prediction- correction 
recursion (Ho and Lee, 1964): 

• Prediction 

<f>k\k-l(%k\yO:k-l) = <f>k-l\k-l(Xk-l\yO:k-l)q(Xk-l,Xk) , (2.19) 

• Correction 

i i | v g(xk,yk)(j>k\k-i(xk\yo:k-i) fn on , 

<j>k\k{Xk\yO:k) = g 7—, v , (2.20) 

h\k-i{yk\yo:k-i) 

where £k\k-i is the predictive distribution of Y k given the past observations io:fe-i- For a 
fixed data realisation, this term is a normalising constant (independent of the state) and is 
thus not necessary to compute in standard implementations of SMC methods. 
By setting p k = <f> k \ k , p k _ x = <f> k -i\k-i, and 

l k _i(x,x') = g{x k ,y k )q{x k _i,x k ) , 

we conclude that the sequence {4> k \ k } k >i of filtering densities can be generated according to 
(2.16). 

The case of fixed interval smoothing works entirely analogously: indeed, since 

4>0:k\k-l{%0:k\y0:k-l) = 0O:fe-l|fe-l (^0:fc-l I yO:k-l)Q{ X k-l i x k) 

and 

* / I \ g{xk,yk)4>k\k-i{xo:k\yo:k-i) 

(PO:k\k{Xk\yO:k) = - ( i v , 

the flow {4>Q:k\k\k>i of smoothing distributions can be generated according to (2.16) by letting 
Pk = 4>o-.k\k, Pk-i = 0o:fc-i|fe-i, and replacing / fe _i(x 0: fe-i, x' 0:k ) dx' 0:k by g(x' k , y k ) q{x k -i, x' k ) dx' k 
<5 a ; . fc _ 1 (dxQ. fc _ 1 ), where S a denotes the Dirac mass located in a. Note that this replacement 
is done formally since the unnormalised transition kernel in question lacks a density in the 
smoothing mode; this is due to the fact that the Dirac measure is singular with respect 
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to the Lebesgue measure. This is however handled by the measure theoretic approach in 
Section 4, implying that all theoretical results presented in the following will comprise also 
fixed interval smoothing. 

We now adapt the procedures considered in the previous section to the sampling of den- 
sities generated according to (2.16). Here we focus on a single time-step, and drop from 
the notation the dependence on k which is irrelevant at this stage. Moreover, set pk = fi, 
Pk-i = v, h = I, and assume that we have at hand a weighted sample {(^7v,i, ^N,i)}iLi 
targeting u, i.e., for any i^-integrable function /, fi^ 1 J2i=i k , JV,i/(£iV,i) approximates the cor- 
responding integral J /(OKO d£- A natural strategy for sampling from \x is to replace v in 
(2.16) by its particle approximation, yielding 



MMSJ - 2^ 777Z ~T7~ 

l= l Ei=l^N,jJ l{£N,j,0<% 



as an approximation of /i, and simulate new particles from this distribution; however, in 
many applications direct simulation from fi^ is infeasible without the application of compu- 
tationally expensive auxiliary accept-reject techniques introduced by Hiirzeler and Kiinsch 
(1998) and thoroughly analysed by Kiinsch (2005). This difficulty can be overcome by sim- 
ulating new particles {£,N,i}i=[ from the instrumental mixture distribution with density 

N i 

tjv(0 = 2^ ^iv ; — r (6v,i,0 > 

»=i Lj=i%w,j 

where {"0iV,i}i=i are the so-called adjustment multiplier weights and r is a Markovian transi- 
tion density function, i.e., ?"(£,£) is a nonnegative function and, for any £ £ S, J r(£,£) d£ = 
1. If one can guess, based on the new observation, which particles are most likely to con- 
tribute significantly to the posterior, the resampling stage may be anticipated by increasing 
(or decreasing) the importance weights. This is the purpose of using the multiplier weights 

ipN,i- We associate these particles with importance weights {f^N{C,N,i) /^N(^N,i)}i=[ • In this 
setting, a new particle position is simulated from the transition proposal density r(£/v,i> •) with 
probability proportional to uj N ^ip N ^. Haplessly, the importance weight /^jv(£jv,«) /^n(Cn,%) is 
expensive to evaluate since this involves summing over N terms. 

We thus introduce, as suggested by Pitt and Shephard (1999), an auxiliary variable cor- 
responding to the selected particle, and target instead the probability density 



J"Z(6v,i,£)df 



(2.21) 



■*3 

on the product space {1, . . . , N} x S. Since /x N is the marginal distribution of /i^ x with re 



spect to the particle index i, we may sample from /xjv by simulating instead a set {(ijv,i> Cw,*)}^ 
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of indices and particle positions from the instrumental distribution 



*Tii, = v-\AT rfo,*, (2.22) 

and assigning each draw (lN,i,£,N,i) the weight 

A ftfriltyii ^N,i) ,_i K£N,I N ,i, 6v,i) /o oo\ 

^ = ^*77 — = ^ n ^1Tp T~\ • ( } 

Hereafter, we discard the indices and let {(6v,i, ^A^0li=i approximate the target density /i. 
Note that setting, for all i G {1, . . . , N}, i/jnj = 1 yields the standard bootstrap particle 
filter presented by Gordon et al. (1993). In the sequel, we assume that each adjustment 
multiplier weight "0jv,j is a function of the particle position -0jv,i = ^(Cjv,?), i £ {!?••• j^}> 
and define 

(2-24) 

so that fi^ x (i,C,)/7if^ x (i,C,) is proportional to 3>(£jv,i, £). We wm re f er to the function \1/ as 
the adjustment multiplier function. 

2.3. Risk minimisation for sequential adaptive importance sampling and resam- 
pling. We may expect that the efficiency of the algorithm described above depends highly 
on the choice of adjustment multiplier weights and proposal kernel. 

In the context of state space models, Pitt and Shephard (1999) suggested to use an approx- 
imation, defined as the value of the likelihood evaluated at the mean of the prior transition, 
i.e. ipN,i — g (J x'q(£,N,i, x') dx', yk) , where yk is the current observation, of the predictive like- 
lihood as adjustment multiplier weights. Although this choice of the weight outperforms the 
conventional bootstrap filter in many applications, as pointed out in Andrieu et al. (2003), 
this approximation of the predictive likelihood could be very poor and lead to performance 
even worse than that of the conventional approach if the dynamic model q(xk-i, Xk) is quite 
scattered and the likelihood g(xk,yk) varies significantly over the prior q(xk-i,Xk). 

The optimisation of the adjustment multiplier weight was also studied by Douc et al. 
(2008) (see also Olsson et al. (2007)) who identified adjustment multiplier weights for which 
the increase of asymptotic variance at a single iteration of the algorithm is minimal. Note 
however that this optimisation is done using a function- specific criterion, whereas we advo- 
cate here the use of function-free criteria. 

In our risk minimisation setting, this means that both the adjustment weights and the 
proposal kernels need to be adapted. As we will see below, these two problems are in general 
intertwined; however, in the following it will be clear that the two criteria CSD and KLD 
behave differently at this point. Because the criteria are rather involved, it is interesting 
to study their behaviour as the number of particles N grows to infinity. This is done in 
Theorem 4.1, which shows that the CSD c/ x 2(/_i^ lx ||7r^ lx ) and KLD dKL^^IKjv*) converges 
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to d x 2(n*\\n^) and g?kl (/•£*! I 71 "^), respectively, where 



//KO'ce.o^^ (2 25) 

The expressions (2.25) of the limiting distributions then allow for deriving the adjustment 
multiplier weight function \l/ and the proposal density / minimising the corresponding dis- 
crepancy measures. In absence of constraints (when \I/ and I can be chosen arbitrarily), the 
optimal solution for both the CSD and the KLD consists in setting ^ = \&* and r = r*, 
where 

= / / ^f^Odi, (2.26) 

r*(£,£l^(£, £)/**(£)■ (2.27) 

This choice coincides with the so-called optimal sampling strategy proposed by Hiirzeler and 
Kiinsch (1998) and developed further by Kiinsch (2005), which turns out to be optimal (in 
absence of constraints) in our risk-minimisation setting. 

Remark 2.1. The limiting distributions fi* and tt^ have nice interpretations within the 
framework of state space models (see the previous section). In this setting, the limiting dis- 
tribution fi* at time k is the joint distribution 4>k-.k+i\k+i of the filtered couple Xk-.k+ii that is, 
the distribution of Xk-.k+i conditionally on the observation record Yo-.k+i,' this can be seen as 
the asymptotic target distribution of our particle model. Moreover, the limiting distribution 
it* at time k is only slightly more intricate: Its first marginal corresponds to the filtering 
distribution at time k reweighted by the adjustment function which is typically used for 
incorporating information from the new observation Yk+\. The second marginal of n* is then 
obtained by propagating this weighted filtering distribution through the Markovian dynamics 
of the proposal kernel R; thus, 7r£ describes completely the asymptotic instrumental distri- 
bution of the APF, and the two quantities c?kl(a** 1 1 71 "*) an d 4k 2 (/■**! I 71 "*) reflect the asymptotic 
discrepancy between the true model and the particle model at the given time step. 

In presence of constraints on the choice of \& and r, the optimisation of the adjustment 
weight function and the proposal kernel density is intertwined. By the so-called chain rule 
for entropy (see Cover and Thomas, 1991, Theorem 2.2.1), we have 

A I -II '\ f " K) .1,11 f £^g)AgA it ft "(£) ue (si MjjA VAl 

where v(f) — J f(£)/(0 d£- Hence, if the optimal adjustment function can be chosen freely, 
then, whatever the choice of the proposal kernel is, the best choice is still ^kl r = ^ ne ^ es ^ 
that we can do is to choose r such that the two marginal distributions £ i— >• J" //*(£, £) d£ 
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and £ i— > j vr*(£,£)d£ are identical. If the choices of the weight adjustment function and 
the proposal kernels are constrained (if, e.g., the weight should be chosen in a pre-specified 
family of functions or the proposal kernel belongs to a parametric family), nevertheless, 
the optimisation of ^ and r decouple asymptotically. The optimisation for the CSD does 
not lead to such a nice decoupling of the adjustment function and the proposal transition; 
nevertheless, an explicit expression for the adjustment multiplier weights can still be found 
in this case: 

^W/tW/Si^- (2 ' 28) 



Compared to (2.26), the optimal adjustment function for the CSD is the L 2 (rather than 
the L 1 ) norm of £ ^ / 2 (£, £)/r 2 (£, £). Since /(£,£) = #*(f)r*(f,f) (see definitions (2.26) and 
(2.27)), we obtain, not surprisingly, if we set r = r*, r (£) = 

Using this risk minimisation formulation, it is possible to select the adjustment weight 
function as well as the proposal kernel by minimising either the CSD or the KLD criteria. 
Of course, compared to the sophisticated adaptation strategies considered for adaptive im- 
portance sampling, we focus on elementary schemes, the computational burden being quickly 
a limiting factor in the SMC context. 

To simplify the presentation, we consider in the sequel the adaptation of the proposal 
kernel; as shown above, it is of course possible and worthwhile to jointly optimise the adjust- 
ment weight and the proposal kernel, but for clarity we prefer to postpone the presentation 
of such a technique to a future work. The optimisation of the adjustment weight function 
is in general rather complex: indeed, as mentioned above, the computation of the optimal 
adjustment weight function requires the computing of an integral. This integral can be 
evaluated in closed form only for a rather limited number of models; otherwise, a numer- 
ical approximation (based on cubature formulae, Monte Carlo etc) is required, which may 
therefore incur a quite substantial computational cost. If proper simplifications and approx- 
imations are not found (which are, most often, model specific) the gains in efficiency are 
not necessarily worth the extra cost. In state space (tracking) problems simple and efficient 
approximations, based either on the EKF or the UKF (see for example Andrieu et al. (2003) 
or Shen et al. (2004)), have been proposed for several models, but the validity of this sort of 
approximations cannot necessarily be extended to more general models. 

In the light of the discussion above, a natural strategy for adaptive design of 7r|^ x is 
to minimise the empirical estimate 8 (or CV 2 ) of the KLD (or CSD) over all proposal 
kernels belonging to some parametric family {rg}e^e. This can be done using straightforward 
adaptations of the two methods described in Section 2.1. We postpone a more precise 
description of the algorithms and implementation issues to after the next section, where 
more rigorous measure-theoretic notation is introduced and the main theoretical results are 
stated. 
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3. Notations and definitions 

To state precisely the results, we will now use measure-theoretic notation. In the following 
we assume that all random variables are defined on a common probability space (Q, J 7 , P) 
and let, for any general state space (3,23(3)), V(S) and B(S) be the sets of probability 
measures on (3,23(3)) and measurable functions from 3 to M, respectively. 

A kernel K from (3, 23(3)) to some other state space (3, 23(3)) is called finite if K(£, 3) < 
oo for all £ G 3 and Markovian if K(£, 3) = 1 for all £ G 3. Moreover, K induces two 
operators, one transforming a function / G B(3 x 3) satisfying J" a |/(£, £)| K(£, d£) < oo 
into another function 

e ~ k{& /) 4 J_ /(e, e) df) 

in B(3); the other transforms a measure z/ G P(3) into another measure 

A»vK{A)± J^K{d,A)v{6£) (3.1) 

in 7^(3). Furthermore, for any probability measure \i G V(S) and function / G B(S) 
satisfying / a /i(d£) < oo, we write //(/) = / B /(£) /i(d£). 

The ottter product of the measure 7 and the kernel T, denoted by 7 <S> T, is defined as the 
measure on the product space 3x3, equipped with the product cr-algebra 23(3) <g> 23(3), 
satisfying 

7®T(A)^ jj^ ^ T (d£)T(£,d£)]U(£,£') (3.2) 

for any A G 23(3) (g) 23(3). For a non-negative function / G B(S), we define the modulated 
measure ^[f] on (3,23(3)) by 

7 [/](^)4 7 (/l A ), (3.3) 

for any A G 23(3). 

In the sequel, we will use the following definition. A set C of real-valued functions on 3 
is said to be proper if the following conditions hold: (i) C is a linear space; (ii) if g G C and 
/ is measurable with |/| < \g\, then |/| G C; (iii) for all c G M, the constant function f = c 
belongs to C. 

Definition 3.1. A weighted sample {(£jv,tj^JV,i)}t=i on " ^ s sa ^ ^° ^ e consistent for the 
probability measure v G P(3) and the set C if, for any f G C, as N — >• 00, 

M N 

^N l ^2 UJ N,if(^N, l ) ► V(f) , 

i=X 

_i IP 

r2 w max to n i — >0, 

\<i<M N 



where Vt N = Y.t=i w N,i- 



18 



J. CORNEBISE, E. MOULINES, AND J. OLSSON 



Alternatively, we will sometimes say that the weighted sample in Definition 3.1 targets the 
measure v. 

Thus, suppose that we are given a weighted sample {(£#,»> ^JV,i)}i=i targeting v € V(S). 
We wish to transform this sample into a new weighted particle sample approximating the 
probability measure 

^L(H) /„£({', SMd£') 

on some other state space (H,i3(H)). Here L is a finite transition kernel from (S,B(S)) to 
(E,£>(E)). As suggested by Pitt and Shephard (1999), an auxiliary variable corresponding 
to the selected stratum, and target the measure 



(3.5) 

on the product space {1,...,Mjv} x S. Since /i^ is the marginal distribution of /i^ x 
with respect to the particle position, we may sample from /x^r by simulating instead a set 

{(Inj, 6v,i)}j=i °f indices and particle positions from the instrumental distribution 

ttJT ({<} x A) 4 ff^*' A) (3.6) 

and assigning each draw (lN,i,£,N,i) the weight 

w ^-^dfl(^, 4l -) (ui) 

being proportional to dp,^ / ^^(In^, ^n,i) — the formal difference with Equation (2.23) lies 
only in the use of Radon- Nykodym derivatives of the two kernels rather than densities with 

respect to Lebesgue measure. Hereafter, we discard the indices and take {(^N,i,^N,i)}i=[ as 
an approximation of fi. The algorithm is summarised below. 



Algorithm 3.1 Nonadaptive APF 



Require: {{£ Nii , ^N,i)}Hi targets v. 

Draw ~ M(M N , {u N ^ N J E?2 ^W^S), 

simulate {6v,}& ~ <S)f=x R(Ui N>i , ■), 
set, for all i £ {1, ... , Mn}, 

4: take {(^J\r,i, ^iY,j)}i=i as an approximation of /i. 
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4. Theoretical results 

Consider the following assumptions. 
(Al) The initial sample {(6v,i, w iv,i)}i=T is consistent for (v,C). 

(A2) There exists a function \& : S — > R + stzc/i £/iai ^jv,i = ^ (Cjv,i)/ moreover, \P G 
C n L 1 (S, i/) and L(-, S) G C. 

Under these assumptions we define for (£, £) G 3 x S the weight function 

mi) = * _1 (0^|4y(0, (4-1) 

so that for every index z, tDjv.i = ^(Ci Ni jCN,i)- The following result describes how the 
consistency property is passed through one step of the APF algorithm. A somewhat less 
general version of this result was also proved in Douc et al. (2008) (Theorem 3.1). 



Proposition 4.1. Assume (Al, A2). Then the weighted sample {(6v,i, ^A^)}f=i 
tent for (z/,C), where C = {/ G L 1 (S, /x), L(-, |/|) G C}. 



consis- 



The result above is a direct consequence of Lemma A. 2 and the fact that the set C is 
proper. 

Let fi and v be two probability measures in V(A) such that \x is absolutely continuous 
with respect to v. We then recall that the KLD and the CSD are, respectively, given by 

dKxG"IH — / log[d/i/dz/(A)] /x(dA) , 
d x 2(^||v) 4 / [d/i/di/(A) - \fv{<S\) . 

J A 

Define the two probability measures on the product space (S x S, B(S) ® B(S)): 

where A G £>(H) <g> £>(H) and the outer product <E> of a measure and a kernel is defined in 
(3.2). 

Theorem 4.1. Assume (Al, A2). Then the following holds as N — > oo. 
(%) //L(-,|log$|) G C fi L 1 (S, v), then 

d KL (iiT\UT) dKL (fi*\\ O , (4.4) 

(ii) If L(-, $) G C, i/ien 

^(^rikD -^dx»(A*1K) , (4.5) 
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Additionally, £ and CV 2 , denned in (2.8) and (2.4) respectively, converge to the same 
limits. 

Theorem 4.2. Assume (Al, A2). Then the following holds as N — > oo. 

(i) IfL(; | log $|) e C fl L 1 (S, v), then 

£({uj N ,i}f=i) -^d Kh (fi*\\nl) . (4.6) 

(ii) If L(-, $) G C, i/ien 

CV 2 ({^}&)-^d x2 (^||vr*) . (4.7) 

Next, it is shown that the adjustment weight function can be chosen so as to minimize 
the RHS of (4.4) and (4.5). 

Proposition 4.2. Assume (Al, A2). Then the following holds, 
(i) IfL(-, |log$|) e C n L X (S, v), then 

arg min^ d KL (//|| vr^) = ^kl,/? ^/iere ^kl,ji(0 - 3 ) • 
If L(-, $) G C, tfien 



arg min^ d x 2 (//|| tt^) = ^* 2;i? w/iere ^* 2)K (0 = J J_ -j (f) L(f , d£) . 

It is worthwhile to notice that the optimal adjustment weights for the KLD do not depend 
on the proposal kernel R. The minimal value c^l (/•**! I ^%* r ) of the limiting KLD is the 
conditional relative entropy between \x* and ir*. 

In both cases, letting R(-,A) = L(-,A)/L(-,a) yields, as we may expect, the optimal 
adjustment multiplier weight function ^klr(') = ^*zr(') = i('iS), resulting in uniform 
importance weights uiN,i = 1- 

It is possible to relate the asymptotic CSD (4.5) between /Xjy x and 7ij^ x to the asymptotic 

variance of the estimator fl^ 1 YliJi &N,if(iN,i) of an expectation //(/) for a given integrable 
target function /. More specifically, suppose that M N /M N —> £ e [0, oo] as N — > oo and 
that the initial sample {(6v,i> ^JV,i)}i=i satisfies, for all / belonging to a given class A of 
functions, the central limit theorem 

M N 

a N Q- N l Y,^NAf{iN,)-^f)] ^A/10,a 2 (/)] , (4.8) 

where the sequence {cln}n is such that a N M N — > /3 G [0, oo) as A 7 " — > oo and a : A — > M + 
is a functional. Then the sample {(£n,u &N,i)}iJi produced in Algorithm 3.1 is, as showed 
in (Douc et al., 2008, Theorem 3.2), asymptotically normal for a class of functions A in the 
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sense that, for all / G A, 

M N 

1=1 

where 

a 2 m(f) = o*{L[; f - Kf)]}/["L(; S)] 2 + WMVRi; $ 2 [/ - v(f)] 2 })v(*)/lvL(S)f 
and, recalling the definition (3.3) of a modulated measure, 

v{^R{.^[f - ^{f)f})^)/[vL{k)f 

= A\f\)dA^\\f\V^{\f\)\W} 

-2^uiKfT)dA^vT\/^\fT)\\^} 

+ 2M/)M/ V2 ) ^K[/- /2 ]/^(/- /2 )lk*} 

+V(/)WIM+/^(I/|)-V(/)- (4.9) 

Here /+ = max(/, 0) and /_ = max(— /, 0) denote the positive and negative parts of /, 
respectively, and refers to the expectation of the extended function |/| : (£,£) G 

~ 1 /2 

S x S i— > £ ^ + under /x* (and similarly for //*(/± )). From (4.9) we deduce that 

decreasing c? x 2 (yLi*| ]7r*) will imply a decrease of asymptotic variance if the discrepancy between 
fi* and modulated measure /•**[!/!] //•**( |/|) is not too large, that is, we deal with a target 
function / with a regular behavour in the support of /x*(3 x •). 

5. Adaptive importance sampling 

5.1. APF adaptation by minimisation of estimated KLD and CSD over a para- 
metric family. Assume that there exists a random noise variable e, having distribution A 
on some measurable space (A, B(A)), and a family {Fg}e^e of mappings from S x A to S 
such that we are able to simulate £ ~ Re(£, ■), for £ G H, by simulating e ~ A and letting 
£ = Fg(£,e). We denote by the importance weight function associated with R g , see (4.1) 
and set $ o F g {£, e) 4 $ e (£, F g {£, e)). 

Assume that (Al) holds and suppose that we have simulated, as in the first step of 

Algorithm 3.1, indices {ijv,i}i=T an d noise variables {ejv,i} i= i ~ \® Mn . Now, keeping these 
indices and noise variables fixed, we can form an idea of how the KLD varies with 9 via the 

mapping 6 i— > £{{§q o F g (^N,i Ni , tN,i)}i=i)- Similarly, the CSD can be studied by using CV 2 
instead of £. This suggests an algorithm in which the particles are reproposed using Re* , 

with 9* N = arg min 06e £{{<& e ° Fe^N,i N>v e Nti )}?li)- 

This procedure is summarised in Algorithm 5.1, and its modification for minimisation of 
the empirical CSD is straightforward. 
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Algorithm 5.1 Adaptive APF 



Require: (Al) 



1: Draw {I N ,i}Zi ~ M{M N ,{uo N ^ N JY.f=i^NAN,t}f=i) 

2: simulate {e^lfe ~ , 

3: 0£ <- arg min eee £({$ e o F e (^ j7 ^, )}£?), 




Remark 5.1. y4. slight modification of Algorithm 5.1, lowering the added computational 
burden, is to apply the adaptation mechanism only when the estimated KLD (or CSD) is 
above a chosen threshold. 

Remark 5.2. It is possible to establish a law of large numbers as well as a central limit 
theorem for the algorithm above, similarly to what has been done for the nonadaptive auxiliary 
particle filter in Douc et al. (2008) and Olsson et al. (2007). 

More specifically, suppose again that (4.8) holds for similar (A, f3, cr(-)) and that Mn/Mn — > 
£ G [0, oo] as N — ► oo. Then the sample {(fjv,j, ^jv,i) produced in Algorithm 5.1 is asymp- 
totically normal for a class of functions A in the sense that, for all f G A , 



**.(/) = /3rM*iM-, K\f - M/)] 2 }M*)/K(s)] 2 + ° 2 (H; if - ^(/)]})/K(s)] 2 



and 0* minimises the asymptotic KLD. The complete proof of this result is however omitted 
for brevity. 

5.2. APF adaptation by cross-entropy (CE) methods. Here we construct an algo- 
rithm which selects a proposal kernel from a parametric family in a way that minimises the 
KLD between the instrumental mixture distribution and the target mixture /i^ x (defined 
in (3.5)). Thus, recall that we are given an initial sample {(6v,i, ^JV,i)}i=i j we then use IS 
to approximate the target auxiliary distribution by sampling from the instrumental 
auxiliary distribution 



where 



x A ) = 



Re(£,N,i, A) , 



(5.1) 
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which is a straightforward modification of (3.6) where R is replaced by Rg, that is, a Markov- 
ian kernel from (H, £>(H)) to (H, £>(H)) belonging to the parametric family {Re{i, ■) : £ £ S, 
9 £ 0}. 

We aim at finding the parameter 8* which realises the minimum of 6 1 i — o?kl(/^/v 1x | | 
over the parameter space G, where 



aux \ 
Nfi) 




dKL^TW^Te) = > J log /^MO ■ (5-2) 



Since the expectation in (5.2) is intractable in most cases, the key idea is to approximate 
it iteratively using IS from more and more accurate approximations — this idea has been 
successfully used in CE methods; see e.g. Rubinstein and Kroese (2004). At iteration £, 
denote by 8 e N £ the current fit of the parameter. Each iteration of the algorithm is split 
into two steps: In the first step we sample, following Algorithm 3.1 with = M l N and 

R = -R^, M l N particles {(i^, CN,i)}iJi from vr^ 1 ^ . Note that the adjustment multiplier 
weights are kept constant during the iterations, a limitation which is however not necessary. 
The second step consists in computing the exact solution 



n ~t 



. aux 



e'r A arg mm £ log |L(4<, I (5.3) 

to the problem of minimising the Monte Carlo approximation of (5.2). In the case where the 
kernels L and Rg have densities, denoted by I and rg, respectively, with respect to a common 
reference measure on (E,£>(H)), the minimisation program (5.3) is equivalent to 

e 1 =argmax^^l gr,(e 4i ,^) . (5.4) 

This algorithm is helpful only in situations where the minimisation problem (5.3) is suffi- 
ciently simple for allowing for closed-form minimisation; this happens, for example, if the 
objective function is a convex combination of concave functions, whose minimum either ad- 
mits a (simple) closed-form expression or is straightforward to minimise numerically. As 
mentioned in Section 2.1, this is generally the case when the function rg(£, •) belongs to an 
exponential family for any £ £ H. 

Since this optimisation problem closely ressembles the Monte Carlo EM algorithm, all the 
implementation details of these algorithms can be readily transposed to that context; see 
for example Levine and Casella (2001), Eickhoff et al. (2004), and Levine and Fan (2004). 
Because we use very simple models, convergence occurs, as seen in Section 6, within only 
few iterations. When choosing the successive particle sample sizes {M^}f =l , we are facing 
a trade-off between precision of the approximation (5.3) of (5.2) and computational cost. 
Numerical evidence typically shows that these sizes may, as high precision is less crucial 
here than when generating the final population from vr^ L , be relatively small compared to 



N 
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the final size Mjv- Besides, it is possible (and even theoretically recommended) to increase 
the number of particles with the iteration index, since, heuristically, high accuracy is less 
required at the first steps. In the current implementation in Section 6, we will show that 
fixing a priori the total number of iterations and using the same number = M^/L of 
particles at each iteration may provide satisfactory results in a first run. 

Algorithm 5.2 CE-based adaptive APF 

Require: c^)}^ targets v. 
1: Choose an arbitrary 9%, 

2: for £ — 0, . . . , L — ldo > More intricate criteria are sensible 

3: draw 

M N 

71=1 

4: simulate {g^jg ~ <g>£f ^(^,, ■). 
5: update 

6: compute, with available closed-form, 
7: end for 

8: run Algorithm 3.1 with R = R b l . 



6. Application to state space models 

For an illustration of our findings we return to the framework of state space models in 
Section 2.2 and apply the CE-adaptation-based particle method to filtering in nonlinear 
state space models of type 

X k+ i = m(X k ) + a w (X k )W k+1 , k > , 

Y k = X k + a v V k , k > , ( • } 

where the parameter a v and the M-valued functions (m, er^) are known, and {W^}^ and 
{Vfc}^ are mutually independent sequences of independent standard normal-distributed 
variables. In this setting, we wish to approximate the filter distributions {(p k \ k } k >o, defined 
in Section 2.2 as the posterior distributions of X k given Y 0:k (recall that Y 0]k — (Y , . . . , Y k )), 
which in general lack closed-form expressions. For models of this type, the optimal weight 
and density defined in (2.26) and (2.27), respectively, can be expressed in closed-form: 

%(x) = M{Y k+1 - m(x), y/ol{x) + al) , (6.2) 
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where J\f(x; fj,, a) = exp(— (x — fi) 2 / (2a 2 ))/ V 2ira 2 and 

r ti x , x ') = N(x'; t(x, Y k+1 ),r}(x)) , (6.3) 

with 

T{X,Y k+1 ) = - 



rj (x) 



We may also compute the chi-square optimal adjustment multiplier weight function q 
when the prior kernel is used as proposal: at time k, 



We recall from Proposition 4.2 that the optimal adjustment weight function for the KLD is 
given by V* KLtQ {x) = V%(x). 

In these intentionally chosen simple example we will consider, at each timestep k, adaption 
over the family 

{Rf>(x, ■) ± N(r(x,Y k+1 ),dr)(x)) : x eR,d > o} (6.5) 

of proposal kernels. In addition, we keep the adjustment weights constant, that is ^(x) — 1. 

The mode of each proposal kernel is centered at the mode of the optimal kernel, and the 
variance is proportional to the inverse of the Hessian of the optimal kernel at the mode. 
Let rg(x,x') = J\f(x';r(x,Y k+1 ),9r](x)) denote the density of Rg(x,-) with respect to the 
Lebesgue measure. In this setting, at every timestep k, a closed-form expression of the KLD 
between the target and proposal distributions is available: 



M N I * 



j I ,,aux| |„^aux\ \ v 



log 




where we set ^* N ^ = ^*(£n,i) and Q N = YlfLi ^N,i- 

As we are scaling the optimal standard deviation, it is obvious that 



(6.6) 



9* N ± argmind K L(/4 UX ira = 1 > (6-7) 

which may also be inferred by straightforward derivation of (6.6) with respect to 9. This 
provides us with a reference to which the parameter values found by our algorithm can be 
compared. Note that the instrumental distribution 7r^*^ differs from the target distribution 
/x^ x by the adjustment weights used: recall that every instrumental distribution in the family 
considered has uniform adjustment weights, ^(x) = 1, whereas the overall optimal proposal 
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has, since it is equal to the target distribution /i^ x , the optimal weights defined in (6.2). 
This entails that 

i=i Z^j=i ^njVnj V Z^j=i ^njVnj J 

which is zero if all the optimal weights are equal. 

The implementation of Algorithm 5.2 is straightforward as the optimisation program (5.4) 
has the following closed-form solution: 

(K e% 2 .i/2 

I i=l ^nV et V ' N ' J ) 

where r/v,i — 7~(6v,i> Yk+i) an d ^Iri — V 2 (^N,i)- This is a typical case where the family of 
proposal kernels allows for efficient minimisation. Richer families sharing this property may 
also be used, but here we are voluntarily willing to keep this toy example as simple as 
possible. 

We will study the following special case of the model (6.1): 

m(x) = 0, a w (x) = \//9o + Pi% 2 ■ 

This is the classical Gaussian autoregressive conditional heteroscedasticity (ARCH) model 
observed in noise (see Bollerslev et al. (1994)). In this case an experiment was conducted 
where we compared: 

(i) a plain nonadaptive particle filter for which \& = 1, that is, the bootstrap particle 
filter of Gordon et al. (1993), 

(ii) an auxiliary filter based on the prior kernel and chi-square optimal weights \l/* 2 q, 

(iii) adaptive bootstrap filters with uniform adjustment multiplier weights using numerical 
minimisation of the empirical CSD and 

(iv) the empirical KLD (Algorithm 5.1), 

(v) an adaptive bootstrap filter using direct minimisation of c^kiX/^ 1 *! |vr^ l g), see (6.7), 

(vi) a CE-based adaptive bootstrap filter, and as a reference, 

(vi) an optimal auxiliary particle filter, i.e. a filter using the optimal weight and proposal 
kernel defined in (6.2) and (6.3), respectively. 

This experiment was conducted for the parameter set (/3o, j3i, a 2 ) — (1,0.99,10), yielding 
(since /3\ < 1) a geometrically ergodic ARCH(l) model (see Chen and Chen, 2000, Theo- 
rem 1); the noise variance a% is equal to 1/10 of the stationary variance, which here is equal 
to of = (3q/ (1 — Pi), of the state process. 

In order to design a challenging test of the adaptation procedures we set, after having run 
a hundred burn-in iterations to reach stationarity of the hidden states, the observations to be 
constantly equal to Y& = 6a s for every k > 110. We expect that the bootstrap filter, having a 
proposal transition kernel with constant mean m(x) = 0, will have a large mean square error 
(MSE) due a poor number of particles in regions where the likelihood is significant. We aim 
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at illustrating that the adaptive algorithms, whose transition kernels have the same mode 
as the optimal transition kernel, adjust automatically the variance of the proposals to that 
of the optimal kernel and reach performances comparable to that of the optimal auxiliary 
filter. 

For these observation records, Figure 1 displays MSEs estimates based on 500 filter means. 
Each filter used 5,000 particles. The reference values used for the MSE estimates were 
obtained using the optimal auxiliary particle filter with as many as 500,000 particles. This 
also provided a set from which the initial particles of every filter were drawn, hence allowing 
for initialisation at the filter distribution a few steps before the outlying observations. 

The CE-based filter of algorithm 5.2 was implemented in its most simple form, with the 
inside loop using a constant number of M l N = N/10 = 500 particles and only L = 5 iterations: 
a simple prefatory study of the model indicated that the Markov chain {0 e N }i>o stabilised 
around the value reached in the very first step. We set 9° N = 10 to avoid initialising at the 
optimal value. 

It can be seen in Figure 1(a) that using the CSD optimal weights combined with the 
prior kernel as proposal does not improve on the plain bootstrap filter, precisely because 
the observations were chosen in such a way that the prior kernel was helpless. On the 
contrary, Figures 1(a) and 1(b) show that the adaptive schemes perform exactly similarly 
to the optimal filter: they all success in finding the optimal scale of the standard deviation, 
and using uniform adjustment weights instead of optimal ones does not impact much. 

We observe clearly a change of regime, beginning at step 110, corresponding to the outlying 
constant observations. The adaptive filters recover from the changepoint in one timestep, 
whereas the bootstrap filter needs several. More important is that the adaptive filters (as 
well as the optimal one) reduce, in the regime of the outlying observations, the MSE of the 
bootstrap filter by a factor 10. 

Moreover, for a comparison with fixed simulation budget, we ran a bootstrap filter with 
3N = 15,000 particles This corresponds to the same simulation budget as the CE-based 
adaptive scheme with N particles, which is, in this setting, the fastest of our adaptive 
algorithms. In our setting, the CE-based filter is measured to expand the plain bootstrap 
runtime by a factor 3, although a basic study of algorithmic complexity shows that this 
factor should be closer to Y2e=i I 'N = 1-5 — the difference rises from Matlab benefitting 
from the vectorisation of the plain bootstrap filter, not from the iterative nature of the CE. 

The conclusion drawn from Figure 1(b) is that for an equal runtime, the adaptive filter 
outperforms, by a factor 3.5, the bootstrap filter using even three times more particles. 

Appendix A. Proofs 

A.l. Proof of Theorems 4.1 and 4.2. We preface the proofs of Theorems 4.1 and 4.2 
with the following two lemmata. 

Lemma A.l. Assume (A2). Then the following identities hold. 

i) <W//|K) = i/<g>L{tog[$i/(tt)/i/L(S)]}/i/L(E) , 
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100 105 110 115 120 125 

Time index 

(a) Auxiliary filter based on chi-square optimal 
weights 'J'* 2 q and prior kernel K (o), adaptive 
filters minimising the empirical KLD (*) and 
CSD (x), and reference filters listed below. 



40 




100 105 110 115 120 125 



Time index 

(b) CE-based adaption (A, dash-dotted line), 
bootstrap filter with 37V particles (□, dashed 
line), and reference filters listed below. 



Figure 1. Plot of MSE performances (on log-scale) on the ARCH model 
with (Po,Pi,cr%) = (1,0.99,10). Reference filters common to both plots are: 
the bootstrap filter (□, continuous line), the optimal filter with weights \&* 
and proposal kernel density r* (O), and a bootstrap filter using a proposal 
with parameter 9* N minimising the current KLD (A, continuous line). The 
MSE values are computed using N = 5,000 particles — except for the reference 
bootstrap using 3N particles (□, dashed line) — and 1,000 runs of each algo- 
rithm. 



n) d x 2 (fx*|| Til) = K*) v ® L($)/>L(H)] 2 - 1 . 

Proof. We denote by <7 (£,£') the Radon-Nikodym derivative of the probability measure /i* 
with respect to v ® R (where the outer product ® of a measure and a kernel is defined in 
(3.2)), that is, 

and by the Radon-Nikodym derivative of the probability measure 7r* with respect to 

v <g> R: 

P(« = ■ (A.2) 
Using the notation above and definition (4.1) of the weight function $, we have 

vL{3) q{£)uL{S) 
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This implies that 

= L{log[$i/(tt)/i/L(S)]}/j/L(S) , 
which establishes assertion i). Similarly, we may write 



d x 2 (fj,*\\ ttI) 



[vL{B)f 
u(^>)u®L(<$>)/[vL(B)} 2 - 1 , 



showing assertion ii). 



□ 



Lemma A.2. Assume (Al, A2) and lei C* = {/ G B(S x H) : L(-, \f\) G C n L 1 (S, i/)}. 
Then, for all f G C* ; as N — > oo ; 



^^^/(W^.^i) — > v®L(f)/vL(B) 



i=i 



Proof. It is enough to prove that 



^Af 1 ^ UJ N,if(^N,I Nti ,iN,; 



v®L{f)/v{y) 



(A.3) 



i=i 



for all / G C*; indeed, since the function / = 1 belongs to C* under (A2), the result 
of the lemma will follow from (A.3) by Slutsky's theorem. Define the measure tp(A) = 
z/(\l/l J 4)/z/(\I/), with A G S(3). By applying Theorem 1 in Douc and Moulines (2008) we 
conclude that the weighted sample {(6v,i, V , JV,i)}i=i is consistent for (<£>,{/ G L^S,^) : 
^\f \ G C}). Moreover, by Theorem 2 in the same paper this is also true for the uniformly 

weighted sample {(^N,i Ni , l)}i=T ( see the proof of Theorem 3.1 in Douc et al. (2008) for 
details). By definition, for / G C*, <p <g> R($\f\) = v ® L(|/|) < oo and $|/|) = 

L(-, |/|) G C. Hence, we conclude that i?(-, $|/|) and thus R(-, $/) belong to the proper set 
{/ G L X (S, </?) : \P|/| G C}. This implies the convergence 



M 



A/jv 
i=l 



N 



A/jv 



*f) 

+ <p® R($f) = v®L(f)/v(V), (A.4) 



i=l 
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where — a ({£N,i Ni }iLi) denotes the cx-algebra generated by the selected particles. It 
thus suffices to establish that 



M N 



M^^NE oj N>i f(^ N>lN .^ N>i ] 



N 



UN,if(t,N,I N ;,£,N,i) 



(A.5) 



and we do this, following the lines of the proof of Theorem 1 in Douc and Moulines (2008), 
by verifying the two conditions of Theorem 11 in the same work. The sequence 



M N 

i=i 



N 



N 



is tight since it tends to v ® L(\f\) / v(^>) in probability (cf. (A. 4)). Thus, the first condition 
is satisfied. To verify the second condition, take e > and consider, for any C > 0, the 
decomposition 



M N 



M^^E \cJN,i\f(UlN,viN,i)\t 



{&N,i\f(€N,i N v iN,i)\>e} 



N 



i=l 



M N M N 

eM N <C] 

i=l i=l 



N 



Since R(-,$f) belongs to the proper set {/ G L^S, cp) : ^|/| G C}, so does the function 
R(-, $|/|1{$|/| > C}). Thus, since the indicator t{eM^ < C} tends to zero, we conclude 
that the upper bound above has the limit <^®i?($|/|l{$|/| > C}); however, by dominated 
convergence this limit can be made arbitrarily small by increasing C. Hence 



M^E [^1/(^,1^)11^1^,^, 



f,i)\>e} 



i=l 



N 



which verifies the second condition of Theorem 11 in Douc and Moulines (2008). Thus, (A.5) 
follows. □ 

Proof of Theorem J,..!. We start with i). In the light of Lemma A.l we establish the limit 



cW/4 ux |K ux ) ^v®L{\og[$v(y)/vL(5)]}/vL(S) 



(A.6) 
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as iV — > oo. Hence, recall the definition (given in Section 4) of the KLD and write, for any 
index m G {1, . . . , M^}, 



M N 



i=l 



'N,m — 1> 



+ log 



t*rm >< s) 



Ylf=i ^n/L^nj, 3) 



, (A.7) 



where denotes the expectation associated with the random measure /x^ x . For each 

term of the sum in (A.7) we have 

u N ,iL log $) 



E , , aux 



log$(6v,J JV , m ,6v,m, 



Z^fcl u N,iL(^NJ, 3) 



and by using the consistency of { (£iv,i> k>iv,i) }j=i (under (Al)) we obtain the limit 



JV.m, 



t=l 



'JV,m — * 



/4 UX (W x 3) ^®L(log$)/z/L(3) 



where we used that L(-,\ log $]) G C by assumption, implying, since C is proper, L(-, log $) G 
C. Moreover, under (A2), by the continuous mapping theorem, 



log 



EMjv / 
3=1 u N,jWN,j 

Ylf=i vnjL^nj, 3) 



log[v(*)/vL(3)] 



yielding 



rf KL (^ ux |K lx ) X(log$)/vL(3) +logK*)/vL(3)] 

= i/ ® L{log[$v(^)/vL(3)]}/z/L(3) 

which establishes (A. 6) and, consequently, z). 
To prove it) we show that 



d^TWT) v ® L{<b)/[uL{*)} 2 - 1 



(Al 



and apply Lemma A.l. Thus, recall the definition of the CSD and write, for any index 
me{l,...,M N }, 



d/i 



aux 
N 



j aux 
N 



M N 



i=l 



d/i 



aux 
N 



J— aux 
N 



N,I N:rn ,C,N,m; 



l N,m 



»Tm x s) - 1 . 
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IK,, aux 



d/i^ ux 

d7r aux 



N,I N ,mi SAT.m, 



1 A ; 



'N,m — 1 



J'=l 



and using the consistency of {(6v,i, <^7v,i)}i=i yields the limit 



M N 
i=l 



J-aux 
N 



(.£,N,I Ntm ,£,N,m) 



l N,m 



fiT({i] x a) — > ® L($)/[ l /L(a)] 2 . 



which proves (A. 8). This completes the proof of ii). 



□ 



Proof of Theorem 1^.2. Applying directly Lemma A. 2 for / = log$ (which belongs to C* by 
assumption) and the limit (A. 3) for / = 1 yields, by the continuous mapping theorem, 



M N 



£(W,ih=l) = ^N^2^N,i log^i + log(MAriV; 



i=l 

v ® L(log $)/z/L(3) + log[i/(*)/i/L(E)] 

= v® L{log[$i/(W)/i/L(S)]}/i/L(E) . 

Now, we complete the proof of assertion i) by applying Lemma A.l. 

We turn to ii). Since $ belongs to C* by assumption, we obtain, by applying Lemma A. 2 
together with (A. 3), 



M N 



cv\{Co N ,}tt l ) = (M N n^)n^J2 



i=l 



From this ii) follows via Lemma A.l. 



Kkl(*) = v ® L($)/[z/L(H)] 2 - 1 . (A.9) 

□ 



A.2. Proof of Proposition 4.2. Define by q(£) = j^R(£,d£')q the marginal density 

of the measure on (3,£(3)), A e 5(3) ^ /i*(A x 3). We denote by q(£'\0 = g(f,OM0 
the conditional distribution. By the chain rule of the entropy, (the entropy of a pair of 
random variables is the entropy of one plus the conditional entropy of the other), we may 
split the KLD between fi* and it* as follows, 

<W/4TIK UX ) = ^KdO?(Oi°g(p -1 (£)<z(£)) + // ^dO^,dO<z(£,0 iogg(£|£') . 
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The second term in the RHS of the previous equation does not depend on the adjustment 
multiplier weight The first term is canceled if we set p — q, i.e. if 



„(¥) J* - / E „(d£)L(£,H) 



which establishes assertion z). 

Consider now assertion wj. Note first that 



"(dOp~W(o-i 



Ag) { 1 Kd °Rpfe " 1 K ^ ~ 1 ' (A - 10) 



where 



g 2 & = J^R^dOq 2 ^?) 



The first term on the RHS of (A.IO) is the CSD between the probability distributions 
associated with the densities g/v(g) and ^ with respect to v. The second term does 
not depend on \I/ and the optimal value of the adjustment multiplier weight is obtained by 
canceling the first term. This establishes assertion ii). 
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